Home > Other Scientific Research Area > Other > Special Issue > Advancements and Emerging Trends in Computer Applications - Innovations, Challenges, and Future Prospects > Web Harvesting: Using Generative AI for Structured Data Extraction

Web Harvesting: Using Generative AI for Structured Data Extraction

Call for Papers

Volume-10 | Issue-3

Last date : 26-Jun-2026

Best International Journal
Open Access | Peer Reviewed | Best International Journal | Indexing & IF | 24*7 Support | Dedicated Qualified Team | Rapid Publication Process | International Editor, Reviewer Board | Attractive User Interface with Easy Navigation

Journal Type : Open Access

First Update : Within 7 Days after submittion

Submit Paper Online

For Author

Research Area


Web Harvesting: Using Generative AI for Structured Data Extraction


Ashwin V. Nikhar



Ashwin V. Nikhar "Web Harvesting: Using Generative AI for Structured Data Extraction" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Special Issue | Advancements and Emerging Trends in Computer Applications - Innovations, Challenges, and Future Prospects, March 2025, pp.230-234, URL: https://www.ijtsrd.com/papers/ijtsrd78442.pdf

Web data extraction is essential across numerous industries, but conventional scrapers are usually inadequate against dynamic webpages, content rendered by JavaScript, and bot protection. The current paper proposes an adaptive web harvesting system using generative AI for structured data extraction in a changing online ecosystem. Our system combines large language models (OpenAI GPT, Groq, Google Generative AI) with high-level automation software (Selenium with ChromeDriver and ChromeDriver) to interpret and adapt to sophisticated webpage layouts dynamically. Our system combines AI-based parsing with traditional libraries like BeautifulSoup4, html2text, and readability-lxml and infers HTML content and re-reconstructs obscured elements based on diffusion models. Data processing becomes efficient with the use of Pandas, Pydantic, and openpyxl, whereas python-dotenv provides strong environment management. Also, reinforcement learning agents are used to mimic human-like interactions that optimize navigation as well as retrieval of data. An easy-to-use interface with Streamlit and streamlit-tags offers real-time data visualization as well as user feedback. Experimental tests on a variety of sites show that our AI-based strategy far surpasses conventional scraping practices in adaptability, precision, and velocity while also meeting ethical and legal constraints in data extraction. This paper provides the foundation for future-proof web harvesting utilities that are efficient and scalable.

Adaptive Web Harvesting, Generative AI, Large Language Models, AI-driven Web Parsing, Reinforcement Learning, Structured Data Extraction, Selenium, ChromeDriver, BeautifulSoup, Diffusion Models, Dynamic Web Scraping, Machine Learning, Ethical Web Scraping, Intelligent Web Crawling.


IJTSRD78442
Special Issue | Advancements and Emerging Trends in Computer Applications - Innovations, Challenges, and Future Prospects, March 2025
230-234
IJTSRD | www.ijtsrd.com | E-ISSN 2456-6470
Copyright © 2019 by author(s) and International Journal of Trend in Scientific Research and Development Journal. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0) (http://creativecommons.org/licenses/by/4.0)

International Journal of Trend in Scientific Research and Development - IJTSRD having online ISSN 2456-6470. IJTSRD is a leading Open Access, Peer-Reviewed International Journal which provides rapid publication of your research articles and aims to promote the theory and practice along with knowledge sharing between researchers, developers, engineers, students, and practitioners working in and around the world in many areas like Sciences, Technology, Innovation, Engineering, Agriculture, Management and many more and it is recommended by all Universities, review articles and short communications in all subjects. IJTSRD running an International Journal who are proving quality publication of peer reviewed and refereed international journals from diverse fields that emphasizes new research, development and their applications. IJTSRD provides an online access to exchange your research work, technical notes & surveying results among professionals throughout the world in e-journals. IJTSRD is a fastest growing and dynamic professional organization. The aim of this organization is to provide access not only to world class research resources, but through its professionals aim to bring in a significant transformation in the real of open access journals and online publishing.

Thomson Reuters
Google Scholer
Academia.edu

ResearchBib
Scribd.com
archive

PdfSR
issuu
Slideshare

WorldJournalAlerts
Twitter
Linkedin